PyBroMo - A1. Reference - Data format and internals

This notebook is part of PyBroMo a python-based single-molecule Brownian motion diffusion simulator that simulates confocal smFRET experiments.

Summary. This notebook describes the HDF5 files used to save diffusion trajectories and raw timestamps. PyBroMo can also generate complete smFRET data files in Photon-HDF5 format, but for this format the reader can refer to official Photon-HDF5 specifications.


In [1]:
import numpy as np
import tables
import pybromo as pbm
print('Numpy version:', np.__version__)
print('PyTables version:', tables.__version__)
print('PyBroMo version:', pbm.__version__)


Numpy version: 1.10.1
PyTables version: 3.2.2
PyBroMo version: 0.6
/Users/anto/miniconda3/lib/python3.5/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

In [2]:
S = pbm.ParticlesSimulation.from_datafile('0168')


 - Found matching timestamps.

File format for trajectories

The simulation is saved in a HDF5 file, one for each running engine. The file has the following content:

  • **/parameters**
    • Numeric parameters (storead as scalar arrays)

      • D
      • t_step
      • t_max
      • EID
      • ID
      • chunksize: used for emission and position arrays
      • np: number of particles
      • pMol: particles concentration in pico-Molar
    • Non-numeric parameters (stored as group attributes)

      • box: the Box() object (stores boundaries and size)
      • particles: the Particles() object, a list of Particle() (containing the initial position positions) and seed.
  • **/psf**
    • Arrays of PSF used in the simulation. This is the raw array as saved from PSFLab. The name of the array is the same as the origina file name. The PSF array has the following attributes:
      • 'fname', 'dir_', 'x_step', 'z_step'
        • The array and its attributes allow to recreate the NumericPSF() object on a simulation reload.
      • TODO: 'wavelength', 'NA', and other PSFLab parameters
    • default_psf: hard link to the PSF used in the simualation, used as persistent name
  • **/trajectories**
    • emission: 2D array of emission traces: one row per particle. Shape: (num_particles, time)
    • emission_tot: 1D array of emission trace: total emission from all the particles: Shape: (time)
    • position: 3D array of positions. Shape (num_particles, 3, time)
  • **/timestamps**
    • Arrays of timestamps for different rate_max, bg_rate and seed.

How to access

The HDF5 file handle is in S.store.data_file after you run S.open_store():


In [3]:
S.compact_name_core()


Out[3]:
'016898_P20_D1.2e-11_P15_D6e-12_75pM_step0.5us'

In [4]:
S.compact_name_core(t_max=True)


Out[4]:
'016898_P20_D1.2e-11_P15_D6e-12_75pM_step0.5us_t_max1.0s'

In [5]:
S.compact_name()


Out[5]:
'016898_P20_D1.2e-11_P15_D6e-12_75pM_step0.5us_t_max1.0s_ID0-0'

HDF5 File inspection


In [6]:
print ('Main groups:\n')
for node in S.store.h5file.root:
    print (node)
    for n in node:
        print ('\t%s' % n.name)
        print ('\t    %s' % n.title)


Main groups:

/parameters (Group) 'Simulation parameters'
	D
	    Diffusion coefficient (m^2/s)
	EID
	    IPython Engine ID (int)
	ID
	    Simulation ID (int)
	chunksize
	    Chunksize for arrays
	np
	    Number of simulated particles
	pico_mol
	    Particles concentration (pM)
	t_max
	    Simulation total time (s)
	t_step
	    Simulation time-step (s)
/psf (Group) 'PSFs used in the simulation'
	default_psf
	    PSF x-z slice (PSFLab array)
	xz_realistic_z50_150_160_580nm_n1335_HR2
	    PSF x-z slice (PSFLab array)
/trajectories (Group) 'Simulated trajectories'
	emission
	    Emission trace of each particle
	emission_tot
	    Summed emission trace of all the particles
	position
	    X-Y-Z position trace of each particle

In [7]:
h5file = S.store.h5file

Group: /parameters


In [8]:
group = '/parameters'

print ('Numeric attributes (nodes) in %s:\n' % group)

print (S.store.h5file.get_node(group))
for node in S.store.h5file.get_node(group)._f_list_nodes():
    print ('\t%s (%s)' % (node.name, str(node.read())))
    print ('\t    %s' % node.title)


Numeric attributes (nodes) in /parameters:

/parameters (Group) 'Simulation parameters'
	D (9.428571428571427e-12)
	    Diffusion coefficient (m^2/s)
	EID (0)
	    IPython Engine ID (int)
	ID (0)
	    Simulation ID (int)
	chunksize (524288)
	    Chunksize for arrays
	np (35)
	    Number of simulated particles
	pico_mol (75.67560551416292)
	    Particles concentration (pM)
	t_max (1)
	    Simulation total time (s)
	t_step (5e-07)
	    Simulation time-step (s)

In [9]:
group = '/parameters'

print ('Attributes in %s:\n' % group)

print (S.store.h5file.get_node(group))
for attr in S.store.h5file.get_node(group)._v_attrs._f_list():
    print ('\t%s' % attr)
    print ('\t    %s' % type(S.store.h5file.get_node(group)._v_attrs[attr]))


Attributes in /parameters:

/parameters (Group) 'Simulation parameters'
	box
	    <class 'pybromo.diffusion.Box'>
	particles
	    <class 'pybromo.diffusion.Particles'>

In [10]:
particles = S.store.h5file.root.parameters._f_getattr('particles')

In [11]:
len(particles), particles.rs_hash


Out[11]:
(35, 'bfb')

Group /psf


In [12]:
group = '/psf'

print ('Nodes in in %s:\n' % group)

print (S.store.h5file.get_node(group))
for node in S.store.h5file.get_node(group)._f_list_nodes():
    print ('\t%s %s' % (node.name, node.shape))
    print ('\t    %s' % node.title)


Nodes in in /psf:

/psf (Group) 'PSFs used in the simulation'
	default_psf (193, 129)
	    PSF x-z slice (PSFLab array)
	xz_realistic_z50_150_160_580nm_n1335_HR2 (193, 129)
	    PSF x-z slice (PSFLab array)

default_psf attributes


In [13]:
node_name = '/psf/default_psf'
node = S.store.h5file.get_node(node_name)

print ("\n%s %s: '%s'" % (node.name, node.shape, node.title))
print ('\n    List of attributes:')
for attr in node.attrs._f_list():
    print ('\t%s' % attr)
    print ("\t    %s" % repr(node._v_attrs[attr]))


default_psf (193, 129): 'PSF x-z slice (PSFLab array)'

    List of attributes:
	dir_
	    '/Users/anto/src/PyBroMo/pybromo/psf_data'
	fname
	    'xz_realistic_z50_150_160_580nm_n1335_HR2'
	x_step
	    0.0625
	z_step
	    0.0625

Group /trajectories


In [14]:
group = '/trajectories'

print ('Nodes in in %s:\n' % group)

print (S.store.h5file.get_node(group))
for node in S.store.h5file.get_node(group)._f_list_nodes():
    print ('\t%s %s' % (node.name, node.shape))
    print ('\t    %s' % node.title)


Nodes in in /trajectories:

/trajectories (Group) 'Simulated trajectories'
	emission (35, 2000000)
	    Emission trace of each particle
	emission_tot (0,)
	    Summed emission trace of all the particles
	position (35, 3, 2000000)
	    X-Y-Z position trace of each particle

In [15]:
group = '/trajectories'

print ('Attributes in %s:\n' % group)

print (S.store.h5file.get_node(group))
for attr in S.store.h5file.get_node(group)._v_attrs._f_list():
    print ('\t%s' % attr)
    print ('\t    %s' % type(S.store.h5file.get_node(group)._v_attrs[attr]))


Attributes in /trajectories:

/trajectories (Group) 'Simulated trajectories'
	init_random_state
	    <class 'tuple'>
	last_random_state
	    <class 'tuple'>
	psf_name
	    <class 'numpy.str_'>

emission attributes


In [16]:
node_name = '/trajectories/emission'
node = S.store.h5file.get_node(node_name)

print ("\n%s %s: '%s'" % (node.name, node.shape, node.title))
print ('\n    List of attributes:')
for attr in node.attrs._f_list():
    print ('\t%s' % attr)
    attr_data = repr(node._v_attrs[attr])
    if len(attr_data) > 300:
        attr_data = hash_(node._v_attrs[attr])
    print ("\t    %s" % attr_data)


emission (35, 2000000): 'Emission trace of each particle'

    List of attributes:
	PyBroMo
	    '0.5+43.g13dd56c'
	creation_time
	    '2015-09-22 19:36:56'

position attributes


In [17]:
node_name = '/trajectories/position'
if node_name in S.store.h5file:
    node = S.store.h5file.get_node(node_name)
    
    print ("\n%s %s: '%s'" % (node.name, node.shape, node.title))
    print ('\n    List of attributes:')
    for attr in node.attrs._f_list():
        print ('\t%s' % attr)
        print ("\t    %s" % repr(node._v_attrs[attr]))
else:
    print ('%s not present.')


position (35, 3, 2000000): 'X-Y-Z position trace of each particle'

    List of attributes:
	PyBroMo
	    '0.5+43.g13dd56c'
	creation_time
	    '2015-09-22 19:36:56'

File format for timestamps


In [18]:
group = '/timestamps'

print ('Nodes in in %s:\n' % group)

print (S.ts_store.h5file.get_node(group))
for node in S.ts_store.h5file.get_node(group)._f_list_nodes():
    print ('\t%s' % node.name)
    print ('\t    %s' % node.title)


Nodes in in /timestamps:

/timestamps (Group) 'Simulated timestamps'
	Pop1_P20_Pstart0_max_rate150000cps_BG800cps_Pop2_P15_Pstart20_max_rate63000cps_BG800cps_t_1s_rs_0eb3bf
	    Simulated photon timestamps
	Pop1_P20_Pstart0_max_rate150000cps_BG800cps_Pop2_P15_Pstart20_max_rate63000cps_BG800cps_t_1s_rs_0eb3bf_par
	    Particle number for each timestamp
	Pop1_P20_Pstart0_max_rate150000cps_BG800cps_t_1s_rs_9b9a03
	    Simulated photon timestamps
	Pop1_P20_Pstart0_max_rate150000cps_BG800cps_t_1s_rs_9b9a03_par
	    Particle number for each timestamp
	Pop1_P20_Pstart0_max_rate50000cps_BG1500cps_Pop2_P15_Pstart20_max_rate117000cps_BG1500cps_t_1s_rs_9832bb
	    Simulated photon timestamps
	Pop1_P20_Pstart0_max_rate50000cps_BG1500cps_Pop2_P15_Pstart20_max_rate117000cps_BG1500cps_t_1s_rs_9832bb_par
	    Particle number for each timestamp
	Pop1_P20_Pstart0_max_rate50000cps_BG1500cps_t_1s_rs_9832bb
	    Simulated photon timestamps
	Pop1_P20_Pstart0_max_rate50000cps_BG1500cps_t_1s_rs_9832bb_par
	    Particle number for each timestamp

In [19]:
group = '/timestamps'

print ('Attributes in %s:\n' % group)

print (S.ts_store.h5file.get_node(group))
for attr in S.ts_store.h5file.get_node(group)._v_attrs._f_list():
    print ('\t%s' % attr)
    print ('\t    %s' % type(S.ts_store.h5file.get_node(group)._v_attrs[attr]))


Attributes in /timestamps:

/timestamps (Group) 'Simulated timestamps'
	init_random_state
	    <class 'tuple'>
	last_random_state
	    <class 'tuple'>

In [20]:
group = '/timestamps'

print ('>> Nodes in %s' % S.ts_store.h5file.get_node(group))

for node in S.ts_store.h5file.get_node(group)._f_list_nodes():
    #print '\t%s' % node.name
    #print '\t    %s' % node.title
    print ("\n%s %s: '%s'" % (node.name, node.shape, node.title))
    print ('\n    List of attributes:')
    for attr in node.attrs._f_list():
        print ('\t%s' % attr)
        attr_data = repr(node._v_attrs[attr])
        if len(attr_data) > 300:
            attr_data = pbm.hash_(node._v_attrs[attr])
        print ("\t    %s" % attr_data)
        
print ('\n>> Attributes in %s:\n' % group)

print (S.ts_store.h5file.get_node(group))
for attr in S.ts_store.h5file.get_node(group)._v_attrs._f_list():
    print ('\t%s' % attr)
    print ('\t    %s' % type(S.ts_store.h5file.get_node(group)._v_attrs[attr]))


>> Nodes in /timestamps (Group) 'Simulated timestamps'

Pop1_P20_Pstart0_max_rate150000cps_BG800cps_Pop2_P15_Pstart20_max_rate63000cps_BG800cps_t_1s_rs_0eb3bf (1528,): 'Simulated photon timestamps'

    List of attributes:
	PyBroMo
	    '0.6'
	bg_rate
	    800
	clk_p
	    4.9999999999999998e-08
	creation_time
	    '2015-11-21 23:54:43'
	init_random_state
	    0eb3bf09731db4625a815c574b0eaca0139488e3
	last_random_state
	    4aebc456d3ae5f4b44461c225c50034dd9a196e0
	max_rates
	    [150000.0, 62999.999999999993]
	populations
	    [slice(0, 20, None), slice(20, 35, None)]

Pop1_P20_Pstart0_max_rate150000cps_BG800cps_Pop2_P15_Pstart20_max_rate63000cps_BG800cps_t_1s_rs_0eb3bf_par (1528,): 'Particle number for each timestamp'

    List of attributes:
	PyBroMo
	    '0.6'
	bg_particle
	    35
	creation_time
	    '2015-11-21 23:54:43'
	num_particles
	    35

Pop1_P20_Pstart0_max_rate150000cps_BG800cps_t_1s_rs_9b9a03 (1065,): 'Simulated photon timestamps'

    List of attributes:
	PyBroMo
	    '0.6'
	bg_rate
	    800
	clk_p
	    4.9999999999999998e-08
	creation_time
	    '2015-11-21 23:54:38'
	init_random_state
	    9b9a0390ffb50b993b12ea26dfbbd19c89c3aeb7
	last_random_state
	    0c3ae0c619e18a56783d70aa8e2eef2308d1561e
	max_rates
	    [150000.0]
	populations
	    [slice(0, 20, None)]

Pop1_P20_Pstart0_max_rate150000cps_BG800cps_t_1s_rs_9b9a03_par (1065,): 'Particle number for each timestamp'

    List of attributes:
	PyBroMo
	    '0.6'
	bg_particle
	    35
	creation_time
	    '2015-11-21 23:54:38'
	num_particles
	    35

Pop1_P20_Pstart0_max_rate50000cps_BG1500cps_Pop2_P15_Pstart20_max_rate117000cps_BG1500cps_t_1s_rs_9832bb (2277,): 'Simulated photon timestamps'

    List of attributes:
	PyBroMo
	    '0.6'
	bg_rate
	    1500
	clk_p
	    4.9999999999999998e-08
	creation_time
	    '2015-11-21 23:54:40'
	init_random_state
	    9832bb4fc1d8db4ac18ed0c9b6a22ec61fb2fc81
	last_random_state
	    0eb3bf09731db4625a815c574b0eaca0139488e3
	max_rates
	    [50000.0, 117000.0]
	populations
	    [slice(0, 20, None), slice(20, 35, None)]

Pop1_P20_Pstart0_max_rate50000cps_BG1500cps_Pop2_P15_Pstart20_max_rate117000cps_BG1500cps_t_1s_rs_9832bb_par (2277,): 'Particle number for each timestamp'

    List of attributes:
	PyBroMo
	    '0.6'
	bg_particle
	    35
	creation_time
	    '2015-11-21 23:54:40'
	num_particles
	    35

Pop1_P20_Pstart0_max_rate50000cps_BG1500cps_t_1s_rs_9832bb (1590,): 'Simulated photon timestamps'

    List of attributes:
	PyBroMo
	    '0.6'
	bg_rate
	    1500
	clk_p
	    4.9999999999999998e-08
	creation_time
	    '2015-11-21 23:54:36'
	init_random_state
	    9832bb4fc1d8db4ac18ed0c9b6a22ec61fb2fc81
	last_random_state
	    9b9a0390ffb50b993b12ea26dfbbd19c89c3aeb7
	max_rates
	    [50000.0]
	populations
	    [slice(0, 20, None)]

Pop1_P20_Pstart0_max_rate50000cps_BG1500cps_t_1s_rs_9832bb_par (1590,): 'Particle number for each timestamp'

    List of attributes:
	PyBroMo
	    '0.6'
	bg_particle
	    35
	creation_time
	    '2015-11-21 23:54:36'
	num_particles
	    35

>> Attributes in /timestamps:

/timestamps (Group) 'Simulated timestamps'
	init_random_state
	    <class 'tuple'>
	last_random_state
	    <class 'tuple'>

In [21]:
S.store.close()
S.ts_store.close()

Random State: reprodicibility and high-quality pseudo-random numbers

PyBromo uses Numpy's RandomState object to track the current state of the random number generator at different simulation stages.

Tracking the random state has a dual purpose. First, it allows to reproduce any simulation step.

Second, it allows to mantain an high-quality pseudo-random number stream when the simulation is resumed. This point is more subtle. Simulation can be performed in different steps. When resuming a simulation to proceed to the nex step it is important to use the last saved random state. In fact, by resetting the random state using an arbitrary seed we may easily introduce correlation between the previous and current stream of random numbers. For example streams generated with seeds 1 and 2 will be correlated (up to 1e6 samples!) because many bits in the seed are the same. This is a property of the Mersenne twister algorithm (see also this paper). To avoid this well-known problem we need to use a single stream by freezing (saving) and restoring the random state at each step.

Notice that there are 3 steps that require random numbers:

  1. Generating the initial particles position
  2. Brownian motion trajectories simulation (3-D trajectories + emission rates)
  3. Generating timestamps based on the emission rates

The random state is tracked as follows:

  1. When generating the particles with gen_particles the new Particles object will contain the .init_random_state attribute (and, as a convenience, a 3 digit hash in .rs_hash)
  2. Whem performing the Brownian motion simulation with .sim_brownian_motion, the '/trajectories' group (shortcut S.traj_group) will store the initial and the final state as group attributes: .init_random_state and .last_random_state. The assumption is that we simulate only 1 Browian motion diffusion per object so makes sense to store these values as group attributes.
  3. When generating timestamps with S.sim_timestamps_em_store, we will generate timestamps several times for different emission or background rates. Therefore the '/timestamps' group (shortcutS.ts_group) contains the last_random_state attribute and each timestamp array contains the init_random_state attribute.

NOTE: Since the random-state data structure (a tuple of a string, an array, and 3 numbers) is numpy-specific we save it without any conversion. In fact, using the same random state in another programming language would require a deep understanding of the Mersenne twister implementation. Not to mention that a proprietary language may not provide such level of details to the user. Anyway, if you are motivated to use the random state in another language, an additional conversion from numpy format would be the least of your problems.